We have recieved data for seven different types of dry beans. In order to develop a supervised macine lerning model, we need to take into account different features of these beans such as form, shape, type, and structure by the market situation which will in turn help us distinguish different varieties of beans that have high feature similarity. The dataset provided consists of data for 13,611 grains of 7 different types which have and a total of 16 features; 12 dimensions and 4 shape forms.
The aim of this project is to develop a supervised machine learning algorithm to perform a multi-classification of dry beans species harvested from population cultivation from a single farm.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler, RobustScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge,Lasso,RidgeCV, LassoCV, ElasticNet, ElasticNetCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score, classification_report
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score,KFold,StratifiedKFold,LeaveOneOut
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
df_root = pd.read_excel("Dry_Bean_Dataset.xlsx")
df_root.head(2)
| Bean ID | Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 28395.0 | 610.291 | 208.178117 | 173.888747 | 1.197191 | 0.549812 | 28715.0 | 190.141097 | 0.763923 | 0.988856 | 0.958027 | 0.913358 | 0.007332 | 0.003147 | 0.834222 | 0.998724 | SEKER |
| 1 | 2.0 | 28734.0 | 638.018 | 200.524796 | 182.734419 | 1.097356 | 0.411785 | 29172.0 | 191.272750 | 0.783968 | 0.984986 | 0.887034 | 0.953861 | 0.006979 | 0.003564 | 0.909851 | 0.998430 | SEKER |
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.
1.At the independent columns are continuous and target(dependent column) is a categorical variable
df_root.describe()
| Bean ID | Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 13611.000000 | 13611.000000 | 13611.000000 | 13609.000000 | 13608.000000 | 13610.000000 | 13609.000000 | 13611.000000 | 13611.000000 | 13611.000000 | 13608.000000 | 13611.000000 | 13607.000000 | 13607.000000 | 13609.000000 | 13611.000000 | 13611.000000 |
| mean | 6806.000000 | 53048.284549 | 855.283459 | 320.155372 | 202.273996 | 1.583268 | 0.750930 | 53768.200206 | 253.064220 | 0.749733 | 0.987142 | 0.873282 | 0.799830 | 0.006564 | 0.001716 | 0.643590 | 0.995063 |
| std | 3929.301592 | 29324.095717 | 214.289696 | 85.693199 | 44.974478 | 0.246669 | 0.091962 | 29774.915817 | 59.177120 | 0.049086 | 0.004661 | 0.059520 | 0.061691 | 0.001128 | 0.000596 | 0.098996 | 0.004366 |
| min | 1.000000 | 20420.000000 | 524.736000 | 183.601165 | 122.512653 | 1.024868 | 0.218951 | 20684.000000 | 161.243764 | 0.555315 | 0.919246 | 0.489618 | 0.640577 | 0.002778 | 0.000564 | 0.410339 | 0.947687 |
| 25% | 3403.500000 | 36328.000000 | 703.523500 | 253.319280 | 175.840519 | 1.432352 | 0.715953 | 36714.500000 | 215.068003 | 0.718634 | 0.985670 | 0.832096 | 0.762444 | 0.005900 | 0.001153 | 0.581359 | 0.993703 |
| 50% | 6806.000000 | 44652.000000 | 794.941000 | 296.899313 | 192.443880 | 1.551132 | 0.764446 | 45178.000000 | 238.438026 | 0.759859 | 0.988283 | 0.883157 | 0.801227 | 0.006645 | 0.001694 | 0.642044 | 0.996386 |
| 75% | 10208.500000 | 61332.000000 | 977.213000 | 376.497678 | 217.036082 | 1.707118 | 0.810471 | 62294.000000 | 279.446467 | 0.786851 | 0.990013 | 0.916869 | 0.834234 | 0.007272 | 0.002170 | 0.696006 | 0.997883 |
| max | 13611.000000 | 254616.000000 | 1985.370000 | 738.860153 | 460.198497 | 2.430306 | 0.911423 | 263261.000000 | 569.374358 | 0.866195 | 0.994677 | 0.990685 | 0.987303 | 0.010451 | 0.003665 | 0.974767 | 0.999733 |
df_root.dtypes
Bean ID float64 Area float64 Perimeter float64 MajorAxisLength float64 MinorAxisLength float64 AspectRation float64 Eccentricity float64 ConvexArea float64 EquivDiameter float64 Extent float64 Solidity float64 roundness float64 Compactness float64 ShapeFactor1 float64 ShapeFactor2 float64 ShapeFactor3 float64 ShapeFactor4 float64 Class object dtype: object
Continuous to Categorical Classification problem
There are total of 7 Class in the target feature
df_missing = df_root.copy()
set(df_missing.Class)
{'BARBUNYA', 'BOMBAY', 'CALI', 'DERMASON', 'HOROZ', 'SEKER', 'SIRA'}
Checking balance of the Data with respect to Target Feature
As the number of classes in targte features are 7 we try to check if the dataset provided is a balanced or imbalanced.
count_class = pd.value_counts(df_root["Class"],sort = True)
count_class.plot(kind='bar',rot = 0)
<AxesSubplot:>
Its is multiclass classification problem with bombay class having lesser data.
Looking for duplicate rows
Number of rows in the dataset that repeat is zero
df_root[df_root.duplicated()]
| Bean ID | Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class |
|---|
Correlation analysis is a statistical method used to measure the strength of the linear relationship between two variables and compute their association.
Correlation analysis calculates the level of change in one variable due to the change in the other.
A high correlation points to a strong relationship between the two variables, while a low correlation means that the variables are weakly related.
df_tmp = df_root.drop(columns=['Bean ID'])
df_root.corr()
| Bean ID | Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bean ID | 1.000000 | -0.369273 | -0.411175 | -0.325045 | -0.503044 | 0.139316 | 0.307748 | -0.369615 | -0.418614 | -0.099929 | 0.087452 | 0.115936 | -0.200357 | 0.602935 | 0.074558 | -0.223753 | 0.058103 |
| Area | -0.369273 | 1.000000 | 0.966722 | 0.931836 | 0.951615 | 0.241692 | 0.267351 | 0.999939 | 0.984968 | 0.054345 | -0.196511 | -0.357530 | -0.267840 | -0.847983 | -0.639305 | -0.272145 | -0.355721 |
| Perimeter | -0.411175 | 0.966722 | 1.000000 | 0.977337 | 0.913217 | 0.385228 | 0.390915 | 0.967689 | 0.991380 | -0.021160 | -0.303887 | -0.547647 | -0.406618 | -0.864671 | -0.767606 | -0.408435 | -0.429310 |
| MajorAxisLength | -0.325045 | 0.931836 | 0.977337 | 1.000000 | 0.826119 | 0.550200 | 0.541746 | 0.932609 | 0.961732 | -0.077982 | -0.284203 | -0.596303 | -0.568118 | -0.773699 | -0.859237 | -0.568093 | -0.482477 |
| MinorAxisLength | -0.503044 | 0.951615 | 0.913217 | 0.826119 | 1.000000 | -0.009318 | 0.019272 | 0.951353 | 0.948567 | 0.145987 | -0.155750 | -0.210290 | -0.014734 | -0.947207 | -0.471539 | -0.019175 | -0.263713 |
| AspectRation | 0.139316 | 0.241692 | 0.385228 | 0.550200 | -0.009318 | 1.000000 | 0.924293 | 0.243258 | 0.303601 | -0.370102 | -0.267582 | -0.766946 | -0.987689 | 0.024676 | -0.837749 | -0.978591 | -0.449225 |
| Eccentricity | 0.307748 | 0.267351 | 0.390915 | 0.541746 | 0.019272 | 0.924293 | 1.000000 | 0.269124 | 0.318512 | -0.319300 | -0.297427 | -0.722177 | -0.970308 | 0.020108 | -0.859938 | -0.981057 | -0.449276 |
| ConvexArea | -0.369615 | 0.999939 | 0.967689 | 0.932609 | 0.951353 | 0.243258 | 0.269124 | 1.000000 | 0.985226 | 0.052564 | -0.206118 | -0.362083 | -0.269698 | -0.847975 | -0.640876 | -0.274024 | -0.362049 |
| EquivDiameter | -0.418614 | 0.984968 | 0.991380 | 0.961732 | 0.948567 | 0.303601 | 0.318512 | 0.985226 | 1.000000 | 0.028383 | -0.231566 | -0.435945 | -0.327391 | -0.892783 | -0.713083 | -0.330389 | -0.392512 |
| Extent | -0.099929 | 0.054345 | -0.021160 | -0.077982 | 0.145987 | -0.370102 | -0.319300 | 0.052564 | 0.028383 | 1.000000 | 0.191330 | 0.344411 | 0.354102 | -0.141634 | 0.237880 | 0.347624 | 0.148502 |
| Solidity | 0.087452 | -0.196511 | -0.303887 | -0.284203 | -0.155750 | -0.267582 | -0.297427 | -0.206118 | -0.231566 | 0.191330 | 1.000000 | 0.607115 | 0.304073 | 0.153371 | 0.343437 | 0.307549 | 0.702138 |
| roundness | 0.115936 | -0.357530 | -0.547647 | -0.596303 | -0.210290 | -0.766946 | -0.722177 | -0.362083 | -0.435945 | 0.344411 | 0.607115 | 1.000000 | 0.768069 | 0.230259 | 0.782785 | 0.763126 | 0.472149 |
| Compactness | -0.200357 | -0.267840 | -0.406618 | -0.568118 | -0.014734 | -0.987689 | -0.970308 | -0.269698 | -0.327391 | 0.354102 | 0.304073 | 0.768069 | 1.000000 | -0.009615 | 0.868726 | 0.998686 | 0.484380 |
| ShapeFactor1 | 0.602935 | -0.847983 | -0.864671 | -0.773699 | -0.947207 | 0.024676 | 0.020108 | -0.847975 | -0.892783 | -0.141634 | 0.153371 | 0.230259 | -0.009615 | 1.000000 | 0.469442 | -0.008398 | 0.248608 |
| ShapeFactor2 | 0.074558 | -0.639305 | -0.767606 | -0.859237 | -0.471539 | -0.837749 | -0.859938 | -0.640876 | -0.713083 | 0.237880 | 0.343437 | 0.782785 | 0.868726 | 0.469442 | 1.000000 | 0.872900 | 0.529856 |
| ShapeFactor3 | -0.223753 | -0.272145 | -0.408435 | -0.568093 | -0.019175 | -0.978591 | -0.981057 | -0.274024 | -0.330389 | 0.347624 | 0.307549 | 0.763126 | 0.998686 | -0.008398 | 0.872900 | 1.000000 | 0.484274 |
| ShapeFactor4 | 0.058103 | -0.355721 | -0.429310 | -0.482477 | -0.263713 | -0.449225 | -0.449276 | -0.362049 | -0.392512 | 0.148502 | 0.702138 | 0.472149 | 0.484380 | 0.248608 | 0.529856 | 0.484274 | 1.000000 |
corrmat = df_tmp.corr()
#Correlation plot
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(df_tmp[top_corr_features].corr(),annot=True,cmap="RdYlGn")
PairPlot
A pairplot plot a pairwise relationships in a dataset. The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. That creates plots as shown below.
plt.figure(figsize = (40,40))
sns.pairplot(data=df_tmp)
<seaborn.axisgrid.PairGrid at 0x1fe70769be0>
<Figure size 2880x2880 with 0 Axes>
It is clearly visible that there is multicollinearity between the features
Based on the heatmap and correlation data from above we can say that their is MultiCollinearity between the dependent features in the data
MultiCollinearity : Multicollinearity is a statistical concept where several independent variables in a model are correlated. Two variables are considered to be perfectly collinear if their correlation coefficient is +/- 1.0. Multicollinearity among independent variables will result in less reliable statistical inferences
Example : Area is highly correlated with ShapeFactor2,ConvexArea and EquiDiameter where ConvexArea and Equidiameter show positive correlation and ShapeFactor2 shows negative correlation|
Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning. In order to make machine learning work well on new tasks, it might be necessary to design and train better features. As you may know, a “feature” is any measurable input that can be used in a predictive model — it could be the color of an object or the sound of someone’s voice. Feature engineering, in simple terms, is the act of converting raw observations into desired features using statistical or machine learning approaches.
Removing Rows that dont contribute to the learning of the model like Indexes in this dataset
The total sum of rows that have a missing value is 19.
When compared to the total rows(14,000) is very less. So we can drop the rows as the information loss will be minium
df_root.drop(columns= ["Bean ID"],inplace=True)
Handlind Missing or Null values in the dependent varialbles,
df_root.isna().sum()
Area 0 Perimeter 0 MajorAxisLength 2 MinorAxisLength 3 AspectRation 1 Eccentricity 2 ConvexArea 0 EquivDiameter 0 Extent 0 Solidity 3 roundness 0 Compactness 4 ShapeFactor1 4 ShapeFactor2 2 ShapeFactor3 0 ShapeFactor4 0 Class 0 dtype: int64
Number of rows will Null values are less and doesnt effect the distribution of the dataset, we can drop the rows
After dropping the rows their is not no change in the statistical parameters of the data
df_root = df_root.dropna()
print(df_root.shape)
df_root.head()
(13591, 17)
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28395.0 | 610.291 | 208.178117 | 173.888747 | 1.197191 | 0.549812 | 28715.0 | 190.141097 | 0.763923 | 0.988856 | 0.958027 | 0.913358 | 0.007332 | 0.003147 | 0.834222 | 0.998724 | SEKER |
| 1 | 28734.0 | 638.018 | 200.524796 | 182.734419 | 1.097356 | 0.411785 | 29172.0 | 191.272750 | 0.783968 | 0.984986 | 0.887034 | 0.953861 | 0.006979 | 0.003564 | 0.909851 | 0.998430 | SEKER |
| 2 | 29380.0 | 624.110 | 212.826130 | 175.931143 | 1.209713 | 0.562727 | 29690.0 | 193.410904 | 0.778113 | 0.989559 | 0.947849 | 0.908774 | 0.007244 | 0.003048 | 0.825871 | 0.999066 | SEKER |
| 4 | 30140.0 | 620.134 | 201.847882 | 190.279279 | 1.060798 | 0.333680 | 30417.0 | 195.896503 | 0.773098 | 0.990893 | 0.984877 | 0.970516 | 0.006697 | 0.003665 | 0.941900 | 0.999166 | SEKER |
| 5 | 30279.0 | 634.927 | 212.560556 | 181.510182 | 1.171067 | 0.520401 | 30600.0 | 196.347702 | 0.775688 | 0.989510 | 0.943852 | 0.923726 | 0.007020 | 0.003153 | 0.853270 | 0.999236 | SEKER |
df_root.isna().sum()
Area 0 Perimeter 0 MajorAxisLength 0 MinorAxisLength 0 AspectRation 0 Eccentricity 0 ConvexArea 0 EquivDiameter 0 Extent 0 Solidity 0 roundness 0 Compactness 0 ShapeFactor1 0 ShapeFactor2 0 ShapeFactor3 0 ShapeFactor4 0 Class 0 dtype: int64
df_root.describe()
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 | 13591.000000 |
| mean | 53077.569421 | 855.565981 | 320.281091 | 202.292563 | 1.583801 | 0.751189 | 53797.983298 | 253.134391 | 0.749698 | 0.987141 | 0.873174 | 0.799700 | 0.006563 | 0.001714 | 0.643315 | 0.995059 |
| std | 29335.656114 | 214.319365 | 85.679586 | 44.999344 | 0.246422 | 0.091731 | 29786.623118 | 59.192047 | 0.049109 | 0.004662 | 0.059491 | 0.061607 | 0.001129 | 0.000595 | 0.098800 | 0.004368 |
| min | 20420.000000 | 524.736000 | 183.601165 | 122.512653 | 1.024868 | 0.218951 | 20684.000000 | 161.243764 | 0.555315 | 0.919246 | 0.489618 | 0.640577 | 0.002778 | 0.000564 | 0.410339 | 0.947687 |
| 25% | 36373.000000 | 703.897000 | 253.441135 | 175.792208 | 1.432988 | 0.716251 | 36750.000000 | 215.201166 | 0.718527 | 0.985669 | 0.832015 | 0.762366 | 0.005898 | 0.001153 | 0.581202 | 0.993695 |
| 50% | 44674.000000 | 795.194000 | 297.079966 | 192.476741 | 1.551509 | 0.764576 | 45224.000000 | 238.496758 | 0.759785 | 0.988282 | 0.883061 | 0.801138 | 0.006644 | 0.001692 | 0.641822 | 0.996381 |
| 75% | 61352.000000 | 977.360500 | 376.520088 | 217.124066 | 1.707447 | 0.810550 | 62326.500000 | 279.492026 | 0.786858 | 0.990011 | 0.916797 | 0.834010 | 0.007273 | 0.002168 | 0.695572 | 0.997877 |
| max | 254616.000000 | 1985.370000 | 738.860153 | 460.198497 | 2.430306 | 0.911423 | 263261.000000 | 569.374358 | 0.866195 | 0.994677 | 0.990685 | 0.987303 | 0.010451 | 0.003665 | 0.974767 | 0.999733 |
Target Column is a categorical Column with 7 different classes.
As it is a classification problem we can map each class to a numerical value and replace in the dataset
dict_Class = {k:v for k, v in zip(list(set(df_root.iloc[:, -1].tolist())),
list(range(len(list(set(df_root.iloc[:, -1].tolist()))))))}
dict_Class
{'DERMASON': 0,
'BOMBAY': 1,
'CALI': 2,
'BARBUNYA': 3,
'HOROZ': 4,
'SIRA': 5,
'SEKER': 6}
df_root["Class"] = df_root["Class"].map(dict_Class)
df_root.head(8)
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28395.0 | 610.291 | 208.178117 | 173.888747 | 1.197191 | 0.549812 | 28715.0 | 190.141097 | 0.763923 | 0.988856 | 0.958027 | 0.913358 | 0.007332 | 0.003147 | 0.834222 | 0.998724 | 6 |
| 1 | 28734.0 | 638.018 | 200.524796 | 182.734419 | 1.097356 | 0.411785 | 29172.0 | 191.272750 | 0.783968 | 0.984986 | 0.887034 | 0.953861 | 0.006979 | 0.003564 | 0.909851 | 0.998430 | 6 |
| 2 | 29380.0 | 624.110 | 212.826130 | 175.931143 | 1.209713 | 0.562727 | 29690.0 | 193.410904 | 0.778113 | 0.989559 | 0.947849 | 0.908774 | 0.007244 | 0.003048 | 0.825871 | 0.999066 | 6 |
| 4 | 30140.0 | 620.134 | 201.847882 | 190.279279 | 1.060798 | 0.333680 | 30417.0 | 195.896503 | 0.773098 | 0.990893 | 0.984877 | 0.970516 | 0.006697 | 0.003665 | 0.941900 | 0.999166 | 6 |
| 5 | 30279.0 | 634.927 | 212.560556 | 181.510182 | 1.171067 | 0.520401 | 30600.0 | 196.347702 | 0.775688 | 0.989510 | 0.943852 | 0.923726 | 0.007020 | 0.003153 | 0.853270 | 0.999236 | 6 |
| 6 | 30477.0 | 670.033 | 211.050155 | 184.039050 | 1.146768 | 0.489478 | 30970.0 | 196.988633 | 0.762402 | 0.984081 | 0.853080 | 0.933374 | 0.006925 | 0.003242 | 0.871186 | 0.999049 | 6 |
| 8 | 30685.0 | 635.681 | 213.534145 | 183.157146 | 1.165852 | 0.514081 | 31044.0 | 197.659696 | 0.771561 | 0.988436 | 0.954240 | 0.925658 | 0.006959 | 0.003152 | 0.856844 | 0.998953 | 6 |
| 9 | 30834.0 | 631.934 | 217.227813 | 180.897469 | 1.200834 | 0.553642 | 31120.0 | 198.139012 | 0.783683 | 0.990810 | 0.970278 | 0.912125 | 0.007045 | 0.003008 | 0.831973 | 0.999061 | 6 |
set(df_root.Class)
{0, 1, 2, 3, 4, 5, 6}
The distribution provides a parameterized mathematical function that can be used to calculate the probability for any individual observation from the sample space. This distribution describes the grouping or the density of the observations, called the probability density function. We can also calculate the likelihood of an observation having a value equal to or lesser than a given value. A summary of these relationships between observations is called a cumulative density function.
All the columns are almost follow a normal distributed so no need to Transformation of the data to make them Normally Distributed
# let's see how data is distributed for every column
df_tmp = df_root.copy()
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
print(df_tmp.columns)
for column in df_tmp:
if plotnumber<=18 :
ax = plt.subplot(5,4,plotnumber)
sns.distplot(df_tmp[column])
plt.xlabel(column,fontsize=20)
plotnumber+=1
plt.show()
Index(['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
'ShapeFactor3', 'ShapeFactor4', 'Class'],
dtype='object')
Most of the columns are normally distributed but some have skewness due to Outliers
print("skewness of each column")
for col in df_root:
print(col,"->",df_root[col].skew())
skewness of each column Area -> 2.951255591634598 Perimeter -> 1.6249749765014962 MajorAxisLength -> 1.3576682685172654 MinorAxisLength -> 2.2357427158379135 AspectRation -> 0.5840957384596052 Eccentricity -> -1.065682818645653 ConvexArea -> 2.9401487885739988 EquivDiameter -> 1.947587451025246 Extent -> -0.8937382041461224 Solidity -> -2.550156750801783 roundness -> -0.6356652140347356 Compactness -> 0.03596081604829935 ShapeFactor1 -> -0.5328984958439338 ShapeFactor2 -> 0.29964487288456076 ShapeFactor3 -> 0.24173496282829024 ShapeFactor4 -> -2.758829845489117 Class -> -0.1576551412046186
Outliers are those data points which differs significantly from other observations present in given dataset. It can occur because of variability in measurement and due to misinterpretation in filling data points.
df_outlier = df_root.drop(columns=["Class"])
df_outlier.head()
| Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28395.0 | 610.291 | 208.178117 | 173.888747 | 1.197191 | 0.549812 | 28715.0 | 190.141097 | 0.763923 | 0.988856 | 0.958027 | 0.913358 | 0.007332 | 0.003147 | 0.834222 | 0.998724 |
| 1 | 28734.0 | 638.018 | 200.524796 | 182.734419 | 1.097356 | 0.411785 | 29172.0 | 191.272750 | 0.783968 | 0.984986 | 0.887034 | 0.953861 | 0.006979 | 0.003564 | 0.909851 | 0.998430 |
| 2 | 29380.0 | 624.110 | 212.826130 | 175.931143 | 1.209713 | 0.562727 | 29690.0 | 193.410904 | 0.778113 | 0.989559 | 0.947849 | 0.908774 | 0.007244 | 0.003048 | 0.825871 | 0.999066 |
| 4 | 30140.0 | 620.134 | 201.847882 | 190.279279 | 1.060798 | 0.333680 | 30417.0 | 195.896503 | 0.773098 | 0.990893 | 0.984877 | 0.970516 | 0.006697 | 0.003665 | 0.941900 | 0.999166 |
| 5 | 30279.0 | 634.927 | 212.560556 | 181.510182 | 1.171067 | 0.520401 | 30600.0 | 196.347702 | 0.775688 | 0.989510 | 0.943852 | 0.923726 | 0.007020 | 0.003153 | 0.853270 | 0.999236 |
Checking for outliers
fig, ax = plt.subplots(figsize=(50,30))
sns.boxplot(data=df_outlier, width= 0.5,ax=ax, fliersize=3)
<AxesSubplot:>
# Show BoxPlot for all features before outlier handling
Numeric_cols = df_outlier.columns
fig, ax = plt.subplots(4, 4, figsize=(15, 12))
for variable, subplot in zip(Numeric_cols, ax.flatten()):
sns.boxplot(y= df_outlier[variable], ax=subplot)
plt.tight_layout()
Many Outliers in the data espically in "Area" and "ConvexArea" columns
Option 1 : Deleting the outlier can cause a data loss of about 3000rows so to avoid that we can use some other method
Option 2 : Replace outlier with boundary of the distribution but as the count of the outlers is huge it will produce skewness in the data
Option 3 : So we can replace the outliers with the Median as median is not effected by the outliers
print("skewness of each column after Outlier Handling")
for column in df_outlier.columns:
Q1= np.percentile(df_outlier[column], 25,interpolation = 'midpoint')
Q3 = np.percentile(df_outlier[column], 75, interpolation = 'midpoint')
IQR = Q3 - Q1
median = df_outlier[column].quantile(0.50)
df_outlier[column] = np.where(df_outlier[column] <=(Q1-1.5*IQR), median,df_outlier[column])
df_outlier[column] = np.where(df_outlier[column] >=(Q3+1.5*IQR), median,df_outlier[column])
print(column, "->", df_outlier[column].skew())
skewness of each column after Outlier Handling Area -> 0.8307316808641193 Perimeter -> 0.597244434360571 MajorAxisLength -> 0.6381489693766639 MinorAxisLength -> 0.5637555378786383 AspectRation -> 0.3870834024559011 Eccentricity -> -0.476649885871401 ConvexArea -> 0.8362281840709223 EquivDiameter -> 0.5576262829534949 Extent -> -0.6504259333521553 Solidity -> -0.7714516398467093 roundness -> -0.42433582578985685 Compactness -> -0.016012129596301765 ShapeFactor1 -> 0.05792896804054203 ShapeFactor2 -> 0.29964487288456076 ShapeFactor3 -> 0.12236241153209534 ShapeFactor4 -> -1.0527209986865944
# Show BoxPlot for all features after outlier handling
Numeric_cols = df_outlier.columns
fig, ax = plt.subplots(4, 4, figsize=(15, 12))
for variable, subplot in zip(Numeric_cols, ax.flatten()):
sns.boxplot(y= df_outlier[variable], ax=subplot)
plt.tight_layout()
After Replacing the outliers with Median(robust to Outliers) the data looks good and outliers are lose of the boundaries which can be considered for building the data
Also the distribution of the data is also not skewed after removal of outliers
Feature scaling is one of the most important data preprocessing step in machine learning. Algorithms that compute the distance between the features are biased towards numerically larger values if the data is not scaled.
Normalization or Min-Max Scaling : is used to transform features to be on a similar scale. The new point is calculated as:
Standardization or Z-Score Normalization : is the transformation of features by subtracting from mean and dividing by standard deviation. This is often called as Z-score.
X_new = (X - mean)/Std
Standardization can be helpful in cases where the data follows a Gaussian distribution
Data is normally Distributed so just applying Standard Scaling is enough
Standard Scaling
trans = StandardScaler()
data = trans.fit_transform(df_outlier)
df_rootSS = pd.DataFrame(data,columns=df_outlier.columns)
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
for column in df_rootSS.columns:
if plotnumber<=18 :
ax = plt.subplot(5,4,plotnumber)
sns.distplot(df_rootSS[column])
plt.xlabel(column,fontsize=20)
plotnumber+=1
plt.show()
Robust Scaling
scalar = RobustScaler()
df_rootRB = pd.DataFrame(scalar.fit_transform(df_outlier))
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1
for column in df_rootRB.columns:
if plotnumber<=18 :
ax = plt.subplot(5,4,plotnumber)
sns.distplot(df_rootRB[column])
plt.xlabel(column,fontsize=20)
plotnumber+=1
plt.show()
dict_skew = {
"Column" : df_rootSS.skew().index,
"Standard Scaling" : df_rootSS.skew().values,
"Robust Scaling" : df_rootRB.skew().values
}
skewness_df = pd.DataFrame(dict_skew)
skewness_df
| Column | Standard Scaling | Robust Scaling | |
|---|---|---|---|
| 0 | Area | 0.830732 | 0.830732 |
| 1 | Perimeter | 0.597244 | 0.597244 |
| 2 | MajorAxisLength | 0.638149 | 0.638149 |
| 3 | MinorAxisLength | 0.563756 | 0.563756 |
| 4 | AspectRation | 0.387083 | 0.387083 |
| 5 | Eccentricity | -0.476650 | -0.476650 |
| 6 | ConvexArea | 0.836228 | 0.836228 |
| 7 | EquivDiameter | 0.557626 | 0.557626 |
| 8 | Extent | -0.650426 | -0.650426 |
| 9 | Solidity | -0.771452 | -0.771452 |
| 10 | roundness | -0.424336 | -0.424336 |
| 11 | Compactness | -0.016012 | -0.016012 |
| 12 | ShapeFactor1 | 0.057929 | 0.057929 |
| 13 | ShapeFactor2 | 0.299645 | 0.299645 |
| 14 | ShapeFactor3 | 0.122362 | 0.122362 |
| 15 | ShapeFactor4 | -1.052721 | -1.052721 |
From the Table above we can see that both Transformation give same results and data is Normally Distributed so no other transformations are required
Both Scaling techniques give same result, so we can use any of the scalling techniques here we went ahead with Standard Scaler
df_prime = df_rootSS.copy()
df_prime["Class"] = list(df_root["Class"])
Feature selection is a way of selecting the subset of the most relevant features from the original features set by removing the redundant, irrelevant, or noisy features.
Variance Inflation Factor : A variance inflation factor(VIF) detects multicollinearity in regression analysis. Multicollinearity is when there’s correlation between predictors (i.e. independent variables) in a model; it’s presence can adversely affect your regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model.
We need to remove the columns with high VIF score so that score of all features gets below 5
So we need to remove Feature with high VIFs and again calculate for all the rest of the columns and continue to do so until all the columns score is around 5
def vif_score(x):
scalar = StandardScaler()
arr = scalar.fit_transform(x)
return pd.DataFrame([[x.columns[i], variance_inflation_factor(arr,i)] for i in range(arr.shape[1])], columns = ["Features","VIF_Score"])
vif_score(df_prime)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 987.574141 |
| 1 | Perimeter | 37.668257 |
| 2 | MajorAxisLength | 10.226559 |
| 3 | MinorAxisLength | 28.718741 |
| 4 | AspectRation | 5.676173 |
| 5 | Eccentricity | 4.469610 |
| 6 | ConvexArea | 1019.261547 |
| 7 | EquivDiameter | 106.526744 |
| 8 | Extent | 1.120211 |
| 9 | Solidity | 1.895040 |
| 10 | roundness | 6.262355 |
| 11 | Compactness | 29.402958 |
| 12 | ShapeFactor1 | 23.336771 |
| 13 | ShapeFactor2 | 10.740645 |
| 14 | ShapeFactor3 | 23.213154 |
| 15 | ShapeFactor4 | 1.660883 |
| 16 | Class | 1.514598 |
df_tmp = df_prime.copy()
df_tmp.drop(columns=['Class'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 987.129191 |
| 1 | Perimeter | 37.507568 |
| 2 | MajorAxisLength | 10.226324 |
| 3 | MinorAxisLength | 28.634145 |
| 4 | AspectRation | 5.676114 |
| 5 | Eccentricity | 4.415332 |
| 6 | ConvexArea | 1014.440900 |
| 7 | EquivDiameter | 105.643845 |
| 8 | Extent | 1.117034 |
| 9 | Solidity | 1.863884 |
| 10 | roundness | 6.125728 |
| 11 | Compactness | 29.099187 |
| 12 | ShapeFactor1 | 20.434658 |
| 13 | ShapeFactor2 | 10.278319 |
| 14 | ShapeFactor3 | 23.203049 |
| 15 | ShapeFactor4 | 1.660862 |
df_tmp.drop(columns=['ConvexArea'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 44.239569 |
| 1 | Perimeter | 37.261501 |
| 2 | MajorAxisLength | 10.226292 |
| 3 | MinorAxisLength | 28.441701 |
| 4 | AspectRation | 5.673256 |
| 5 | Eccentricity | 4.413360 |
| 6 | EquivDiameter | 105.241687 |
| 7 | Extent | 1.116506 |
| 8 | Solidity | 1.830235 |
| 9 | roundness | 6.110601 |
| 10 | Compactness | 29.095556 |
| 11 | ShapeFactor1 | 20.242795 |
| 12 | ShapeFactor2 | 10.255677 |
| 13 | ShapeFactor3 | 23.202849 |
| 14 | ShapeFactor4 | 1.660845 |
df_tmp.drop(columns=['EquivDiameter'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 26.377697 |
| 1 | Perimeter | 30.865171 |
| 2 | MajorAxisLength | 10.133648 |
| 3 | MinorAxisLength | 25.818747 |
| 4 | AspectRation | 5.632993 |
| 5 | Eccentricity | 4.412026 |
| 6 | Extent | 1.116506 |
| 7 | Solidity | 1.821931 |
| 8 | roundness | 6.071027 |
| 9 | Compactness | 28.416524 |
| 10 | ShapeFactor1 | 18.196214 |
| 11 | ShapeFactor2 | 10.164802 |
| 12 | ShapeFactor3 | 23.179403 |
| 13 | ShapeFactor4 | 1.660495 |
df_tmp.drop(columns=['Perimeter'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 20.837077 |
| 1 | MajorAxisLength | 8.639152 |
| 2 | MinorAxisLength | 25.759283 |
| 3 | AspectRation | 5.632554 |
| 4 | Eccentricity | 4.410109 |
| 5 | Extent | 1.114657 |
| 6 | Solidity | 1.821883 |
| 7 | roundness | 4.941783 |
| 8 | Compactness | 28.276408 |
| 9 | ShapeFactor1 | 16.614149 |
| 10 | ShapeFactor2 | 10.123211 |
| 11 | ShapeFactor3 | 23.176231 |
| 12 | ShapeFactor4 | 1.659389 |
df_tmp.drop(columns=['Compactness'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 20.170413 |
| 1 | MajorAxisLength | 8.635127 |
| 2 | MinorAxisLength | 25.306934 |
| 3 | AspectRation | 5.353142 |
| 4 | Eccentricity | 4.392590 |
| 5 | Extent | 1.113670 |
| 6 | Solidity | 1.812668 |
| 7 | roundness | 4.883257 |
| 8 | ShapeFactor1 | 16.410663 |
| 9 | ShapeFactor2 | 9.602287 |
| 10 | ShapeFactor3 | 11.323253 |
| 11 | ShapeFactor4 | 1.658743 |
df_tmp.drop(columns=['MinorAxisLength'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | Area | 12.622901 |
| 1 | MajorAxisLength | 8.574261 |
| 2 | AspectRation | 5.305245 |
| 3 | Eccentricity | 4.392573 |
| 4 | Extent | 1.113173 |
| 5 | Solidity | 1.802818 |
| 6 | roundness | 4.869437 |
| 7 | ShapeFactor1 | 8.727018 |
| 8 | ShapeFactor2 | 9.581197 |
| 9 | ShapeFactor3 | 11.006606 |
| 10 | ShapeFactor4 | 1.656902 |
df_tmp.drop(columns=['Area'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | MajorAxisLength | 7.322690 |
| 1 | AspectRation | 5.170812 |
| 2 | Eccentricity | 4.387638 |
| 3 | Extent | 1.112438 |
| 4 | Solidity | 1.791607 |
| 5 | roundness | 4.790035 |
| 6 | ShapeFactor1 | 3.378236 |
| 7 | ShapeFactor2 | 9.444791 |
| 8 | ShapeFactor3 | 10.625017 |
| 9 | ShapeFactor4 | 1.655286 |
df_tmp.drop(columns=['ShapeFactor3'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | MajorAxisLength | 7.183420 |
| 1 | AspectRation | 4.700802 |
| 2 | Eccentricity | 3.190545 |
| 3 | Extent | 1.111899 |
| 4 | Solidity | 1.781582 |
| 5 | roundness | 4.555491 |
| 6 | ShapeFactor1 | 2.768339 |
| 7 | ShapeFactor2 | 8.582353 |
| 8 | ShapeFactor4 | 1.652437 |
df_tmp.drop(columns=['ShapeFactor2'],inplace=True)
vif_score(df_tmp)
| Features | VIF_Score | |
|---|---|---|
| 0 | MajorAxisLength | 5.478921 |
| 1 | AspectRation | 3.795605 |
| 2 | Eccentricity | 3.179796 |
| 3 | Extent | 1.108359 |
| 4 | Solidity | 1.744490 |
| 5 | roundness | 4.142303 |
| 6 | ShapeFactor1 | 2.766302 |
| 7 | ShapeFactor4 | 1.574653 |
vif_var = vif_score(df_tmp)
colms = (vif_var["Features"].values)
df_prime = df_prime[np.append(colms,"Class")]
df_prime.head()
| MajorAxisLength | AspectRation | Eccentricity | Extent | Solidity | roundness | ShapeFactor1 | ShapeFactor4 | Class | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.494114 | -1.673551 | -0.025058 | 0.249564 | 0.300339 | 1.467362 | 0.697402 | 1.072735 | 6 |
| 1 | -1.604958 | -2.132047 | -0.025058 | 0.703317 | -1.018448 | 0.217883 | 0.320377 | 0.962874 | 6 |
| 2 | -1.426796 | -1.616047 | -0.025058 | 0.570786 | 0.539801 | 1.288237 | 0.603806 | 1.200817 | 6 |
| 3 | -1.585795 | -2.299943 | -0.025058 | 0.457261 | 0.994507 | 1.939919 | 0.019427 | 1.238212 | 6 |
| 4 | -1.430643 | -1.793529 | -0.025058 | 0.515899 | 0.523115 | 1.217878 | 0.364620 | 1.264304 | 6 |
MultiCollinearily removed
print(df_prime.corr())
sns.heatmap(df_prime.corr(), cmap="YlGnBu", annot=True)
MajorAxisLength AspectRation Eccentricity Extent \
MajorAxisLength 1.000000 0.623954 0.594229 -0.113960
AspectRation 0.623954 1.000000 0.790845 -0.253065
Eccentricity 0.594229 0.790845 1.000000 -0.218750
Extent -0.113960 -0.253065 -0.218750 1.000000
Solidity -0.291867 -0.292597 -0.254339 0.178352
roundness -0.718971 -0.716648 -0.695146 0.254880
ShapeFactor1 -0.628504 -0.026844 -0.031794 -0.074377
ShapeFactor4 -0.484125 -0.486231 -0.435788 0.096686
Class 0.083533 -0.124493 -0.027616 0.009776
Solidity roundness ShapeFactor1 ShapeFactor4 Class
MajorAxisLength -0.291867 -0.718971 -0.628504 -0.484125 0.083533
AspectRation -0.292597 -0.716648 -0.026844 -0.486231 -0.124493
Eccentricity -0.254339 -0.695146 -0.031794 -0.435788 -0.027616
Extent 0.178352 0.254880 -0.074377 0.096686 0.009776
Solidity 1.000000 0.557045 0.135051 0.415943 0.120401
roundness 0.557045 1.000000 0.263692 0.471580 0.013536
ShapeFactor1 0.135051 0.263692 1.000000 0.193376 -0.370859
ShapeFactor4 0.415943 0.471580 0.193376 1.000000 0.072610
Class 0.120401 0.013536 -0.370859 0.072610 1.000000
<AxesSubplot:>
There is no high correlation between features and target variables
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality decomposition in machine learning. It is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called the Principal Components
pca = PCA()
principalComponents = pca.fit_transform(df_prime.drop(columns=["Class"]))
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Explained Variance')
plt.show()
From the diagram above, it can be seen that 4 principal components explain almost 90% of the variance in data and 6 principal components explain around 95% of the variance in data.
So, instead of giving all the columns as input, we’d only feed these 5 principal components of the data to the machine learning algorithm and we’d obtain a similar result.
pca = PCA(n_components=5)
new_data = pca.fit_transform(df_prime.drop(columns=["Class"]))
# This will be the new data fed to the algorithm.
principal_Df = pd.DataFrame(data = new_data
, columns = ['principal component 1', 'principal component 2','principal component 3','principal component 4','principal component 5'])
principal_Df['Class'] = list(df_prime["Class"])
principal_Df
| principal component 1 | principal component 2 | principal component 3 | principal component 4 | principal component 5 | Class | |
|---|---|---|---|---|---|---|
| 0 | -2.655565 | -0.497196 | 0.148967 | 0.136680 | -0.270488 | 6 |
| 1 | -1.916533 | -0.017437 | 1.229663 | 0.852133 | -1.178407 | 6 |
| 2 | -2.663991 | -0.277538 | -0.180201 | 0.198720 | -0.432386 | 6 |
| 3 | -3.343154 | 0.255319 | -0.204249 | -0.251362 | -0.000988 | 6 |
| 4 | -2.675149 | -0.087268 | -0.069893 | 0.050199 | -0.487487 | 6 |
| ... | ... | ... | ... | ... | ... | ... |
| 13586 | -0.958646 | -0.624323 | -0.386779 | -1.174194 | -0.005368 | 0 |
| 13587 | -1.653223 | 0.543385 | -0.924785 | 0.178687 | -0.296305 | 0 |
| 13588 | -1.129661 | -0.140446 | 0.005781 | -0.691203 | 0.427898 | 0 |
| 13589 | -0.482116 | -0.461576 | 0.769115 | -0.613146 | 0.705906 | 0 |
| 13590 | -0.621267 | -0.120633 | -1.079679 | 0.248793 | -0.677759 | 0 |
13591 rows × 6 columns
Number of columns are decreased from 18 to 8 usinf ViF scores and from 8 to 5 using Principle Component Analysis(Curse of Dimensionality reduced)
X = principal_Df.drop(columns = ["Class"])
y = principal_Df.Class
The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
Split the data into Training and test data where 80% of the data is used for learning the model and test 20% is used for testing/validation
x_train_c1,x_test_c1,y_train_c1,y_test_c1 = train_test_split(X,y, test_size= 0.20, random_state = 355)
x_train_c1.shape
(10872, 5)
x_test_c1.shape
(2719, 5)
print(y_train_c1.shape)
{ k:list(y_train_c1).count(k) for k in set(y_train_c1)}
(10872,)
{0: 2846, 1: 435, 2: 1291, 3: 1068, 4: 1555, 5: 2096, 6: 1581}
print(y_test_c1.shape)
{ k:list(y_test_c1).count(k) for k in set(y_test_c1)}
(2719,)
{0: 700, 1: 87, 2: 339, 3: 254, 4: 373, 5: 540, 6: 426}
Checking if after the split one class is dominating the other class or not
Data us well split and balanced
Need to select parameters as multinormial or OVR(one verses Rest)
lr_model_c1 = LogisticRegression(multi_class='multinomial',)
lr_model_c1.fit(x_train_c1,y_train_c1)
LogisticRegression(multi_class='multinomial')
Compare LR model with a none Linear Model
rf_classifier_c1 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
rf_classifier_c1.fit(x_train_c1,y_train_c1)
RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=42)
It is a data partitioning strategy so that you can effectively use your dataset to build a more generalized model. The main intention of doing any kind of machine learning is to develop a more generalized model which can perform well on unseen data. One can build a perfect model on the training data with 100% accuracy or 0 error, but it may fail to generalize for unseen data. So, it is not a good model. It overfits the training data. Machine Learning is all about generalization meaning that model’s performance can only be measured with data points that have never been used during the training process. That is why we often split our data into a training set and a test set.
kfold_validation=KFold(40)
results_kfold_lr_c1 = cross_val_score(lr_model_c1,X,y,cv=kfold_validation)
results_kfold_rf_c1 = cross_val_score(rf_classifier_c1,X,y,cv=kfold_validation)
print("K_Fold validation Score Logestic Regression(Linear Model) CASE1(80:20): ",np.mean(results_kfold_lr_c1))
print("K_Fold validation Score Random Forest (Non - Linear Model) CASE1(80:20): ",np.mean(results_kfold_rf_c1))
K_Fold validation Score Logestic Regression(Linear Model) CASE1(80:20): 0.8070338799236509 K_Fold validation Score Random Forest (Non - Linear Model) CASE1(80:20): 0.7896800711435017
After the kfold Validation we can see that the models is well build and generalized.
After Comparision with a nonLinear model the Linear Model performs well
x_train_c2,x_test_c2,y_train_c2,y_test_c2 = train_test_split(X,y, test_size= 0.10, random_state = 355)
x_train_c2.shape
(12231, 5)
x_test_c2.shape
(1360, 5)
print(y_train_c2.shape)
{ k:list(y_train_c2).count(k) for k in set(y_train_c2)}
(12231,)
{0: 3200, 1: 475, 2: 1462, 3: 1182, 4: 1750, 5: 2374, 6: 1788}
print(y_test_c2.shape)
{ k:list(y_test_c2).count(k) for k in set(y_test_c2)}
(1360,)
{0: 346, 1: 47, 2: 168, 3: 140, 4: 178, 5: 262, 6: 219}
Need to select parameters as multinormial or OVR(one verses Rest)
lr_model_c2 = LogisticRegression(multi_class='multinomial')
lr_model_c2.fit(x_train_c2,y_train_c2)
LogisticRegression(multi_class='multinomial')
Compare LR model with a none Linear Model
rf_classifier_c2 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
rf_classifier_c2.fit(x_train_c2,y_train_c2)
RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=42)
kfold_validation=KFold(40)
results_kfold_lr_c2 = cross_val_score(lr_model_c2,X,y,cv=kfold_validation)
results_kfold_rf_c2 = cross_val_score(rf_classifier_c2,X,y,cv=kfold_validation)
print("K_Fold validation Score Logestic Regression(Linear Model) CASE2(90:10): ",np.mean(results_kfold_lr_c2))
print("K_Fold validation Score Random Forest(Linear Model) CASE2(90:10): ",np.mean(results_kfold_rf_c2))
K_Fold validation Score Logestic Regression(Linear Model) CASE2(90:10): 0.8070338799236509 K_Fold validation Score Random Forest(Linear Model) CASE2(90:10): 0.7896800711435017
After the kfold Validation we can see that the model is well build and generalized. We can safely state that the model built is free from Overfitting and uderfitting
Let's consider linear regression. Imagine that each single prediction y^ produces a "conditional" distribution pmodel(y^∣x), given a sufficiently large train set. The goal of the learning algorithm is again to match the distribution pdata(y∣x). Now we need an assumption. We hypothesize the neural network or any estimator f as y^=f(x,θ). The estimator approximates the mean of the normal distribution N(μ,σ) the we choose to parametrize pdata. Specifically, in the simplest case of linear regression we have μ=θTx. We also assume a fixed standard deviation σ of the normal distribution. These assumptions immediately causes MLE to become Mean Squared Error (MSE) optimization. Let's see how. \begin{aligned} & \hat{y}=f(\textbf{x} , \boldsymbol{\theta}) \\ y & \sim \mathcal{N}\left(y , \mu=\hat{y}, \sigma^{2}\right) \\ p(y \mid \textbf{x} , \boldsymbol{\theta}) &=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-(y-\hat{y})^{2}}{2 \sigma^{2}}\right) \end{aligned} In terms of log-likelihood we can form a loss function: \begin{aligned} L &=\sum_{i=1}^{m} \log p(y \mid \textbf{x} , \boldsymbol{\theta}) \\ &=\sum_{i=1}^{m} \log \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-\left(\hat{y}^{(i)}-y^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m}-\log (\sigma \sqrt{2 \pi})-\log \exp \left(\frac{(\hat{y}^{(i)}-y^{{(i)}} )^{2}.}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m}-\log (\sigma)-\frac{1}{2} \log (2 \pi)-\frac{(\hat{y}^{(i)}-y^{{(i)}})^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left(\hat{y}^{(i)}-y^{{(i)}}\right)^{2}}{2 \sigma^{2}} \\ \end{aligned} By taking the partial derivative with respect to the parameters, we get the desired MSE. \begin{aligned} \nabla_{\theta} L &=-\nabla_{\theta} \sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)- \frac{m}{2 \sigma^{2}} MSE \end{aligned}
In linear regression, we pmodel(y∣x,θ) as a normal distribution. More precisely, we parametrized the mean to be μ=θTx. It is possible to convert linear regression to a classification problem. All we need to do is encode the ground truth as a one-hot vector: \begin{cases}1 & \text { if } y=y_{i} \\ 0 & \text { otherwise }\end{cases} , where ii refer to a single data instance. \begin{aligned} H_{i}\left(p_{data}, p_{model}\right) &=-\sum_{y \in Y} p_{data}\left(y \mid \textbf{x}_{i}\right) \log p_{model}\left(y \mid \textbf{x}_{i}\right) \\ &=-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned} For simplicity let's consider the binary case of two labels, 0 and 1. \begin{aligned} L &=\sum_{i=1}^{n} H_{i}\left(p_{data}, p_{model}\right) \\ &=\sum_{i=1}^{n}-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \\ &=-\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned} \begin{aligned} =\underset{\boldsymbol{\theta}}{\arg \min } L &= \underset{\boldsymbol{\theta}}{\arg \min } -\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}
This is in line with our definition of conditional MLE:
θML=θargmaxi=1∑mlogpmodel(y(i)∣x(i),θ)
Broadly speaking, MLE can be applied to most (supervised) learning problems, by specifying a parametric family of (conditional) probability distributions. Another way to achieve this in a binary classification problem would be to take the scalar output y of the linear layer and pass it from a sigmoid function. The output will be in the range [0,1] and we define this as the probability of p(y=1∣x,θ).
p(y=1∣x,θ)=σ(θTx)=sigmoid(θTx)∈[0,1]
Consequently, p(y=0∣x,θ)=1−p(y=1∣x,θ). In this case binary-cross entropy is practically used. No closed form solution exist here, one can approximate it with gradient descend. For reference, this approach is surprisingly known as ""logistic regression".
Regularization helps us control our model capacity, ensuring that our models are better at making (correct) classifications on data points that they were not trained on, which we call the ability to generalize. If we don’t apply regularization, our classifiers can easily become too complex and overfit to our training data, in which case we lose the ability to generalize to our testing data (and data points outside the testing set as well, such as new images in the wild).
However, too much regularization can be a bad thing. We can run the risk of underfitting, in which case our model performs poorly on the training data and is not able to model the relationship between the input data and output class labels (because we limited model capacity too much).
reg_case1 = []
# loop over our set of regularizers
for r in (None, "l1", "l2"):
# train a SGD classifier using a softmax loss function and the
# specified regularization function for 10 epochs
print("[INFO] training model with '{}' penalty".format(r))
model = SGDClassifier(loss="log", penalty=r, max_iter=100,
learning_rate="constant", tol=1e-3, eta0=0.01, random_state=10)
model.fit(x_train_c1, y_train_c1)
# evaluate the classifier
acc = model.score(x_test_c1, y_test_c1)
reg_case1.append(acc)
print("[INFO] {} penalty accuracy: {:.2f}%".format(r,
acc * 100))
[INFO] training model with 'None' penalty [INFO] None penalty accuracy: 84.85% [INFO] training model with 'l1' penalty [INFO] l1 penalty accuracy: 84.88% [INFO] training model with 'l2' penalty [INFO] l2 penalty accuracy: 84.81%
# loop over our set of regularizers
reg_case2 = []
for r in (None, "l1", "l2"):
# train a SGD classifier using a softmax loss function and the
# specified regularization function for 10 epochs
print("[INFO] training model with '{}' penalty".format(r))
model = SGDClassifier(loss="log", penalty=r, max_iter=100,
learning_rate="constant", tol=1e-3, eta0=0.01, random_state=10)
model.fit(x_train_c2, y_train_c2)
# evaluate the classifier
acc = model.score(x_test_c2, y_test_c2)
reg_case2.append(acc)
print("[INFO] {} penalty accuracy: {:.2f}%".format(r,
acc * 100))
[INFO] training model with 'None' penalty [INFO] None penalty accuracy: 85.96% [INFO] training model with 'l1' penalty [INFO] l1 penalty accuracy: 86.03% [INFO] training model with 'l2' penalty [INFO] l2 penalty accuracy: 86.03%
index_reg = ["Normal","L1","L2(Regularization(Weight Decay)"]
out_reg = pd.DataFrame(list(zip(reg_case1,reg_case2)),columns=[ "Case1","Case2"],index = [index_reg])
out_reg
| Case1 | Case2 | |
|---|---|---|
| Normal | 0.848474 | 0.859559 |
| L1 | 0.848841 | 0.860294 |
| L2(Regularization(Weight Decay) | 0.848106 | 0.860294 |
Since, our model is neither underfit or overfit, the outputs for both case 1 as well as 2 even post regulrization is comparable and close to the output without regularization. However, performance with 90:10 split is slightly better than 80:20 split.
The Logestic Regression Models build give the following predictions and are used for evaluation of the models
y_pred_c1 = lr_model_c1.predict(x_test_c1)
y_pred_c2 = lr_model_c2.predict(x_test_c2)
performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.
It is extremely useful for measuring Recall, Precision, Specificity, Accuracy, and most importantly AUC-ROC curves.
dict_Class
{'DERMASON': 0,
'BOMBAY': 1,
'CALI': 2,
'BARBUNYA': 3,
'HOROZ': 4,
'SIRA': 5,
'SEKER': 6}
print("Confusion Matrix from Case1(80:20)")
confusion_matrix(y_test_c1,y_pred_c1)
Confusion Matrix from Case1(80:20)
array([[632, 0, 0, 0, 2, 46, 20],
[ 0, 7, 13, 6, 4, 57, 0],
[ 0, 2, 292, 35, 7, 2, 1],
[ 0, 1, 33, 210, 4, 5, 1],
[ 1, 0, 6, 1, 358, 7, 0],
[ 40, 9, 3, 3, 13, 457, 15],
[ 21, 2, 0, 0, 0, 14, 389]], dtype=int64)
print("Confusion Matrix from Case1(90:10)")
confusion_matrix(y_test_c2,y_pred_c2)
Confusion Matrix from Case1(90:10)
array([[315, 0, 0, 0, 1, 20, 10],
[ 0, 4, 8, 5, 1, 29, 0],
[ 0, 0, 150, 15, 2, 0, 1],
[ 0, 1, 18, 115, 3, 3, 0],
[ 0, 0, 2, 0, 174, 2, 0],
[ 13, 4, 2, 1, 10, 227, 5],
[ 10, 1, 0, 0, 0, 7, 201]], dtype=int64)
Accuracy : The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions
matrix_c1 = [
accuracy_score(y_test_c1,y_pred_c1),
precision_score(y_test_c1,y_pred_c1,average="weighted",),
recall_score(y_test_c1,y_pred_c1,average="weighted"),
f1_score(y_test_c1,y_pred_c1,average="weighted")
]
matrix_c2 = [
accuracy_score(y_test_c2,y_pred_c2),
precision_score(y_test_c2,y_pred_c2,average="weighted"),
recall_score(y_test_c2,y_pred_c2,average="weighted"),
f1_score(y_test_c2,y_pred_c2,average="weighted")
]
index_matrix = ["Accuracy","Precision","Recall","F1-Score"]
out_matrix = pd.DataFrame(list(zip(matrix_c1,matrix_c2)),columns=[ "Case1","Case2"],index = [index_matrix])
out_matrix
| Case1 | Case2 | |
|---|---|---|
| Accuracy | 0.862449 | 0.872059 |
| Precision | 0.850961 | 0.861143 |
| Recall | 0.862449 | 0.872059 |
| F1-Score | 0.853872 | 0.862379 |
The comparison for accuracy, precision, recall and f1-score shows that data is comparable and the model has shown good performance. We cannot alone rely on accuracy score as it will be biased, so we also calculate precision (where false positives hold imporatance) and recall scores (where false negatives hold imporatance) and ultimately find F1-score (which takes both precision and recall into account) in order to do effective performance evaluation.
We performed multi-class classification on dry beans dataset. To achieve the same data below steps were followed -